Skip to content

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825

Open
herley-shaori wants to merge 4 commits intoapache:mainfrom
herley-shaori:fix/15812-cjk-bigram-position-inconsistency
Open

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825
herley-shaori wants to merge 4 commits intoapache:mainfrom
herley-shaori:fix/15812-cjk-bigram-position-inconsistency

Conversation

@herley-shaori
Copy link

Summary

Fixes #15812

CJKBigramFilter produces different token positions for the same input depending on whether outputUnigrams is true or false. This breaks phrase queries when index-time and search-time analyzers use different outputUnigrams settings — a common optimization pattern for CJK search.

Root cause

In flushBigram(), when outputUnigrams=false, bigrams are emitted with the default positionIncrement=1, but a bigram conceptually spans two character positions. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens are assigned positions that are off by 1 compared to the outputUnigrams=true case.

Example with input "一二、三":

outputUnigrams=true:  一(pos0) 一二(pos0) 二(pos1) 三(pos2)
outputUnigrams=false: 一二(pos0) 三(pos1) ← should be pos2

Fix

Following the principle suggested by @rmuiroutputUnigrams=false should behave as if unigrams were emitted, then later removed — this PR tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary.

Two new fields in CJKBigramFilter:

  • hadBigrams: set true when a bigram is flushed in no-unigram mode
  • deferredPosInc: accumulated extra position increment, applied at the next segment transition (unaligned offsets, non-CJK token, or end of stream)

The deferred increment is applied in flushBigram(), flushUnigram(), and the non-CJK passthrough path in incrementToken().

Changes

  • CJKBigramFilter.java: Added position tracking logic across CJK segment boundaries
  • TestCJKBigramFilter.java: Added 3 new test cases reproducing the bug; updated testHanOnly expected positions
  • TestWithCJKBigramFilter.java (ICU): Updated expected positions in testJa2, testMix, testMix2, testReusableTokenStream, and testFinalOffset
  • CHANGES.txt: Added bug fix entry

Test plan

  • All 15 CJKBigramFilter tests pass (including 3 new tests)
  • All 12 ICU TestWithCJKBigramFilter tests pass
  • Code formatting verified via ./gradlew tidy
  • testBigramPositionsConsistentAcrossWordBreak — reproduces exact scenario from issue
  • testBigramPositionsMultipleSegments — verifies across multiple CJK segments with breaks
  • testBigramPositionsBeforeNonCJK — verifies CJK bigram followed by non-CJK text

…ams disabled (apache#15812)

When outputUnigrams=false, CJKBigramFilter produced different token
positions compared to outputUnigrams=true. A bigram spans two character
positions but only advanced the position counter by 1. After a word
break (punctuation, whitespace, or non-CJK text), subsequent tokens
were assigned incorrect positions, breaking phrase queries in combined
unigram+bigram indexing strategies.

The fix tracks whether bigrams were emitted from the current CJK
segment and defers an extra position increment (+1) to apply to the
first token after a segment boundary. This ensures outputUnigrams=false
behaves "as if unigrams were emitted then removed", keeping positions
aligned across both settings.

Example: "一二、三"
  Before: 一二(pos0) 三(pos1) — wrong, positions don't match
  After:  一二(pos0) 三(pos2) — correct, matches outputUnigrams=true
@github-actions github-actions bot added this to the 11.0.0 milestone Mar 14, 2026
@rmuir
Copy link
Member

rmuir commented Mar 14, 2026

Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests?

…mFilter behavior

CJKAnalyzer uses CJKBigramFilter with outputUnigrams=false, so its tests
need the same position increment updates applied to TestCJKBigramFilter
and TestWithCJKBigramFilter: after a CJK bigram segment boundary, the
next token now correctly gets positionIncrement=2 instead of 1.

Updates testJa2, testMix, testMix2, testReusableTokenStream, and
testFinalOffset.
@herley-shaori
Copy link
Author

Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests?

Done! All tests have passed.

termAtt.setLength(len);
offsetAtt.setOffset(startOffset[index], endOffset[index]);
typeAtt.setType(SINGLE_TYPE);
if (!outputUnigrams && deferredPosInc > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of the if seems unnecessary, since deferredPosInc is never incremented unless !outputUnigrams.

Suggested change
if (!outputUnigrams && deferredPosInc > 0) {
if (deferredPosInc > 0) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Applied your suggestion and extended the same reasoning to the other guards:

  • flushUnigram(): removed !outputUnigrams && (your suggestion)
  • flushBigram(): added if (deferredPosInc > 0) guard to skip the redundant setPositionIncrement(1) when
    clearAttributes() already defaults to 1
  • incrementToken() (both segment boundary checks): removed !outputUnigrams && before hadBigrams — since hadBigrams is only ever set true inside the !outputUnigrams branch of flushBigram(), the outer check is redundant.

Also fixed TestCJKAnalyzer (testJa2, testMix, testMix2, testReusableTokenStream, testFinalOffset) — same position increment updates needed since CJKAnalyzer uses CJKBigramFilter with outputUnigrams=false.

herley added 2 commits March 16, 2026 10:24
deferredPosInc is only ever incremented when !outputUnigrams, so the
extra condition is unnecessary. Suggested by rmuir in review.
hadBigrams is only ever set true inside the !outputUnigrams branch of
flushBigram(), so checking !outputUnigrams before testing hadBigrams is
redundant. Same reasoning applies to the deferredPosInc guard in
flushBigram() — clearAttributes() already defaults posInc to 1, so we
only need to call setPositionIncrement when deferredPosInc > 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CJKBigramFilter produces inconsistent token positions with outputUnigrams enabled vs disabled

2 participants